Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit session state during metadata queries in Iceberg #19757

Merged
merged 4 commits into from
Nov 16, 2023

Conversation

findepi
Copy link
Member

@findepi findepi commented Nov 15, 2023

Metadata queries such as information_schema.columns,
system.jdbc.columns or system.metadata.table_comments may end up
loading arbitrary number of relations within single query (transaction).
It is important to bound memory usage for such queries.

In case of Iceberg Hive metastore based catalog, this is already done in
TrinoHiveCatalogFactory bu means of configuring per-query
CachingHiveMetastore. However, catalogs with explicit caching need
something similar.

@cla-bot cla-bot bot added the cla-signed label Nov 15, 2023
@github-actions github-actions bot added the iceberg Iceberg connector label Nov 15, 2023
@findepi findepi force-pushed the findepi/iceberg-oom-limit branch from a1ab9d5 to 2c3e111 Compare November 15, 2023 21:23
Metadata queries such as `information_schema.columns`,
`system.jdbc.columns` or `system.metadata.table_comments` may end up
loading arbitrary number of relations within single query (transaction).
It is important to bound memory usage for such queries.

In case of Iceberg Hive metastore based catalog, this is already done in
`TrinoHiveCatalogFactory` bu means of configuring per-query
`CachingHiveMetastore`. However, catalogs with explicit caching need
something similar.
@findepi findepi force-pushed the findepi/iceberg-oom-limit branch from 2c3e111 to ad51682 Compare November 15, 2023 21:40
@@ -689,6 +690,7 @@ public void dropCorruptedTable(ConnectorSession session, SchemaTableName schemaT
}
String tableLocation = metadataLocation.replaceFirst("/metadata/[^/]*$", "");
deleteTableDirectory(fileSystemFactory.create(session), schemaTableName, tableLocation);
invalidateTableCache(schemaTableName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved to dropTableFromMetastore or even to deleteTable to simplify code and prevent from omitting in case new functions in future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about it. technically it would work, but i considered dropTableFromMetastore being just a technical operation, which may or may not be invoked, or be the last operation as part of the drop flow

@@ -365,6 +365,7 @@ private static Optional<String> getQueryId(io.trino.plugin.hive.metastore.Table
public void unregisterTable(ConnectorSession session, SchemaTableName schemaTableName)
{
dropTableFromMetastore(schemaTableName);
invalidateTableCache(schemaTableName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this and folowing be moved to dropTableFromMetastore ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi
Copy link
Member Author

findepi commented Nov 16, 2023

CI #16315 (#8920), #15187
retrying

@findepi findepi merged commit 8d4f26b into master Nov 16, 2023
96 of 97 checks passed
@findepi findepi deleted the findepi/iceberg-oom-limit branch November 16, 2023 15:25
@github-actions github-actions bot added this to the 434 milestone Nov 16, 2023
@mosabua
Copy link
Member

mosabua commented Nov 17, 2023

No release note entry @findepi ?

@findepi findepi added the no-release-notes This pull request does not require release notes entry label Nov 17, 2023
@alexjo2144
Copy link
Member

My understanding is that the caching here is important for ensuring that queries use the same snapshot in different phases of query execution. It's unlikely that a regular select would fill this cache but it'd be nice if we could have them be unbounded when necessary.

@findepi
Copy link
Member Author

findepi commented Nov 17, 2023

I see your point and agree.
I allowed myself to be a bit lazy here.
The pattern exists in Hive connector since long time. It's configurable there (hive.per-transaction-metastore-cache-maximum-size) but i don't recall people needing to configure this. Of course I agree it would be better to have a configuration toggle for this.

We probably should also update the JDBC connector. DefaultJdbcMetadataFactory creates CachingJdbcClient with unbounded cache, so it can OOM for large metadata queries.
cc @hashhar @kokosing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed iceberg Iceberg connector no-release-notes This pull request does not require release notes entry
Development

Successfully merging this pull request may close these issues.

4 participants